5 Findings

5.1 Word Frequencies

In our sample, the longest piece of classic literature is Aurora Leigh, at 89,224 words, while the shortest is the three little pigs, at 1,109, meaning our longest piece is almost 90 times the length of our shortest.

On the fanfiction side, our longest is The Marks we Make, at 270,559 words, while the shortest is That Awkward Moment When Your Whole Class Shows up At Your House, at 28,918, just shy of a tenth the length of our longest. It does make up for it by having the longest title, however.

Although the extremes of our classic lit. sample are proportionally much more spread out, they’re actually distributed much more evenly, as can be seen by the squat inner quartiles in the graph above.

There are certainly examples outside our sample of one group far surpassing the other in either direction: War and Peace is nearly 600,000 words, while the entire genre colloquially known as “CrackFic” is premised around being under 100.

Top 100 words for (Left) classic literature; (Right) modern fanfiction

Figure 5.1: Top 100 words for (Left) classic literature; (Right) modern fanfiction

We see here that books use more variety of pronouns the same amount of times. Where as the fan fictions tend to use he and his and him more.

Top 100 words which soley exist in (Left) classic literature; (Right) modern fanfiction

Figure 5.2: Top 100 words which soley exist in (Left) classic literature; (Right) modern fanfiction

5.2 Punctuation

Punctuation frequency

Figure 5.3: Punctuation frequency

It’s worth nothing that the method we used for tokenizing punctuation results in a lot of simplification: colons, semi-colons, and ellipses are all classified as ‘:’. Exclamation marks and periods are considered the same. Despite this, there’s still a bit that can be gleaned from this graphic: classic literature has a higher proportion of commas and partial sentence breaks, which might point towards more compound sentences, and the hashtag only appears in our fanfiction sample.

5.3 Dialogue

Figure 5.4: Average length-to-segment ratio of a single-person dialogue sequence

Figure 5.5: Average length of continuous conversation

Figure 5.6: Variance of the length of continous conversation

5.4 Part of Speech

Next, we’re going to look at parts of speech (POS). The natural language processing method we used recognizes ~34 ‘parts of speech’, with most of the ‘extra’ POS in the package being more specific applications of the main 8 English POS. For example, ‘big’,‘bigger’,and ‘biggest’ are all adjectives, but the package would categorize them as simple, comparative, and superlative adjectives.

We’re going to look at what proportion of the text of each story the 6 most common parts of speech make up. These are singular nouns, pronouns,prepositions,determiners (the, a, that), simple adjectives, and simple adverbs.

Figure 5.7: Top 6 Part of Speech Comparison

What really stands out here is, again, that there’s a lot of variation in our classical literature that we’re not seeing in our fanfiction sample. However, every part of speech has at least one story from our sample that is a statistical outlier compared to the others. Also noteworthy is the proportionally smaller number of singular nouns used in fanfiction.

As we discovered the parts of speech in comparison to the overall words, books had more of these major categories. It is interesting to note that books used more nouns and pronouns than fanfictions.

5.5 Sentence Structure

Another aspect we looked at was the number of unique sentence structures (SS) in each story. As it turns out, there are MANY ways to put together a sentence in English, so in this case we used an (extra) simplified way of structuring a sentence: nouns (including pronouns), verbs, adjectives, and adverbs only, no punctuation. Despite this, we still had stories with nearly (but not over) 9,000 unique sentence structures.

There are many types of structures. Honing in on sentences with counts above 50 fan fictions have 23 different sentence structures.

For books we see that there are only 6 different structures used more than 50 times. We also checked to see if we were at just the cusp of where books start to use more structures and we found that not many more structures were used more than 30 times. It is reasonable to assume that books have a higher variety of sentence structures leading to lower counts.

Intuitively, a longer story will have more unique SS in it, and this seems to pan out in our data. However, the relationship between length and unique SS seems to be much stronger in fanfiction than in modern literature: in fact, our longest classic, Aurora Leigh, has a comparatively small number of SS (372), which falls much closer to Three Little Pigs (64) than to the next-longest classic, The Secret Garden (2,226). It is worth considering that Aurora Leigh is written in verse, which may impact SS variety.